Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce

نویسندگان

  • W. P. McNeill
  • Hakan Kardes
  • Andrew Borthwick
چکیده

Record Linkage is the task of identifying which records in a database refer to the same entity. A standard machine learning approach to this problem is to train a model that assigns scores to pairs of records where pairs scoring above a threshold are said to represent the same entity. However, it is too expensive to make pairwise comparisons among all records in large databases. “Blocking” is the process of grouping similar-seeming records into blocks that a machine learning component then explores exhaustively. In many blocking approaches, records are grouped together into blocks by shared properties that are indicators of duplication. However, when dealing with very large data sources, it is nearly impossible to determine any fixed set of properties at training time that will be optimal for the Zipfian distribution of values for these properties that we will encounter at run time. In this paper, we propose a novel Dynamic Blocking algorithm which automatically chooses the blocking properties in a data-driven way at execution time to efficiently determine which pairs of records in a data set should be examined as potential duplicates without creating the same pair across blocks. We demonstrate the viability of this algorithm for large data sets. We have scaled this system up to work on billions of records on an 80-node Hadoop cluster.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sorted Nearest Neighborhood Clustering for Efficient Private Blocking

Record linkage is an emerging research area which is required by various real-world applications to identify which records in different data sources refer to the same real-world entities. Often privacy concerns and restrictions prevent the use of traditional record linkage applications across different organizations. Linking records in situations where no private or confidential information can...

متن کامل

Towards Parameter-free Blocking for Scalable Record Linkage

linking or matching databases is becoming increasingly important in many data mining projects, as linked data can contain information that is not available otherwise, or that would be too expensive to collect. a main challenge when linking large databases is the complexity of the linkage process: potentially each record in one database has to be compared with all records in the other database. ...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

MPI for Big Data: New tricks for an old dog

The processing of massive amounts of data on clusters with finite amount of memory has become an important problem facing the parallel/distributed computing community. While MapReduce-style technologies provide an effective means for addressing various problems that fit within the MapReduce paradigm, there are many classes of problems for which this paradigm is ill-suited. In this paper we pres...

متن کامل

Parallel Sorted Neighborhood Blocking with MapReduce

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce j...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012